Table of Content
This guide is written during the release of LLVM version 21.0.0
Building clang and LLVM
Dependencies (because why not?)
Package | Version | Notes |
---|---|---|
CMake | >=3.20.0 | Makefile/workspace generator |
python | >=3.8 | Automated test suite |
Zlib | >=1.2.3.4 | Compression library |
GNU Make | 3.79, 3.79.1 | Makefile/build processor |
PyYAML | >=5.1 | Header generator |
Building clang (trust me you’ll need it)
Clang is a llvm based compiler driver which provide frontend and invokes right tools for the llvm backend to compile the C-like languages.
git clone https://github.com/llvm/llvm-project.git
cmake -DLLVM_ENABLE_PROJECTS=clang -GNinja -DCMAKE_BUILD_TYPE=Release llvm
ninja clang
How I know the tags you may ask? checkout the llvm-project/llvm/CMakeLists.txt
file and search for LLVM_ALL_PROJECTS
and read the comments above it.
OR the instructions are also available here: https://clang.llvm.org/get_started.html
Building LLVM (yeah the dragon itself)
mkdir build
cd build
cmake -G Ninja \
-DCMAKE_BUILD_TYPE=Debug \
-DLLVM_ENABLE_PROJECTS="all" \
-DLLVM_OPTIMIZED_TABLEGEN=1 \
../llvm
ninja # this will build all the targets
Notably some major targets generated inside you’ll find inside build/bin
are:
opt
: Driver for performing optimizations on LLVM IR i.e. LLVM IR => Optimized LLVM IRllc
: Driver for transforming LLVM IR to assembly or object filellvm-mc
Driver to interact with assembled and disassembled objectcheck
Driver to run test
Let’s quickly run the check
target with ninja and it’ll automatically run the test suite
ninja check
# or specific target
ninja check-<target-name>
# can print all the targets available by
ninja help
LLVM also has llvm-lit
tool that runs the specified tests
./bin/llvm-lit test/CodeGen/RISCV/GlobalISel
# Output
-- Testing: 403 tests, 96 workers --
PASS: LLVM :: CodeGen/RISCV/GlobalISel/regbankselect/fp-arith-f16.mir (1 of 403)
PASS: LLVM :: CodeGen/RISCV/GlobalISel/legalizer/legalize-bswap-rv64.mir (2 of 403)
PASS: LLVM :: CodeGen/RISCV/GlobalISel/irtranslator/calls.ll (3 of 403)
[...]
There is also a tool inside LLVM known as FileCheck
this basically governs the verification of test outputs. I won’t be going in depth but take it as you can write the expected outputs within comments inside the IR or any file and the lit
when running test invoke FileCheck
for verification, here is a small example
; RUN: opt < %s -passes=mem2reg | FileCheck %s
define i32 @example() {
; CHECK-LABEL: @example(
; CHECK-NOT: alloca
; CHECK: ret i32 42
%x = alloca i32
store i32 42, ptr %x
%result = load i32, ptr %x
ret i32 %result
}
In this case FileCheck
will verify that the optimized IR has a label “@example
” and optimize the current IR by removing the “alloca
” instruction and directly returning 42 as int32.
We can also verify this by on our own quickly. Create a test.ll
file with the exact same content as shown above and since we already build the LLVM so we have the required tools to test this, run the following commands:
# This will output the optimized IR (you can verify the pattern by yourself)
./bin/opt -S < ../../tweaks/test.ll -passes=mem2reg
# Here we are using the generated optimized IR as stdin to the FileCheck
./bin/opt -S < ../../tweaks/test.ll -passes=mem2reg | ./bin/FileCheck ../../tweaks/test.ll # no output mean success
./bin/opt < ../../tweaks/test.ll -passes=mem2reg
: will generate the bytecode by default, use-S
to get the textual representationpasses=mem2reg
is just tellingopt
driver what optimization pass to apply
Tools like lit
and FileCheck
are build by LLVM folks but they are very general to use for other projects and langauge as well, semi-colon ;
comments are specific to LLVM but these tools can work with any language and its comment styles
Read more about them and usage:
You should be thinking of a question at the moment (atleast I had):
The optimization can reorder the instruction, how these checks would be valid in that case? well short answer, go to the docs and read about CHECK-DAG
LLVM tests are located inside the build
directory as: - unittests
: typical test written using gtest suite - tests
: they are the lit
tests, we already saw the example of a typical lit test in llvm above
LLVM-Project directory tree
- The high level codebase division is organized among projects: mlir, clang, lldb, openmp, etc. which we also pass inside the
-DLLVM_ENABLE_PROJECTS
to specify the targets during build - Each of the project is further organized as:
lib
: Contain all the libraries like for llvm-project: CodeGen, Analysis, Linker, etcinclude/<project>
: These are the exposed public headers- Here the structure is as, you can see a folder for each project as in
lib
- Here the structure is as, you can see a folder for each project as in
tools
: Contains specific tools for llvm: like xcode-toolchain, llc, etcunittests
: Gteststests
: llvm-it testsutils
: Utility tools likeFileCheck
andllvm-lit
NoteFor any library inside the
lib
folder, you can find correspondinginclude/<project>/<lib>
unittests/<lib>
test/<lib>
- To gather the public and private headers for some
<project>/<lib>
- Public headers:
<project>/include/<project>/<lib>/<file>
- Private headers:
<project>/lib/<file>
- The paths inside the code file can also be identified in
include
directive as:- Public:
<project>/<lib>/<file>
- Private:
<file>
- Public:
- Public headers:
llvm
project specifications:- LLVM-IR
- It is mainly found inside the
lib/IR
- IT optimization specifics are available at
lib/Analysis
andlib/Transforms
- Binary and textual representations are handled in
lib/Bitcode
,lib/IRReader
andlib/ITWriter
- It is mainly found inside the
- General backend code generation is in
lib/CodeGen
- Target specific backend code generation:
lib/Target/<backend>
- LLVM-IR
Some Important Compiler Lingo
Build Time:
Time taken to build the compiler; the time it takes for make or ninja to complete the build of LLVM
Compile Time:
Time taken to convert a source file to the object file by the compiler
Runtime:
Time taken to execute that binary file. i.e. time taken to run the binary file produced by clang
Runtime of the compiler is basically the compile time of the application
Canonical form
A recommended way of representing expressions or a piece of code.
= b + 2;
a // OR
= 2 + b; a
Agreeing on a canonical form means that the compiler will strive to generate expressions in only one way and sometimes it is possible that any other form can impact performance.
Middle-End
Usually this term is acknowledged inside backend but for the specificity here comes the target-agnostic transformations of the IR.
Backend
Target-Specific transformations on the IR
Three levels in LLVM
- LLVM IR (High-level, target-independent)
- Machine IR (Low-Level, target-specific, still IR)
- Assembly/Machine Code (Final output)
LLVM IR Machine IR Assembly
------- ---------- --------
define i32 @add → MachineFunction → add:
%result = add → ADD32 %vreg1 → mov eax, edi
ret i32 %result → RET %vreg1 → add eax, esi
→ ret
Application Binary Interface (ABI)
It can be tricky at first to understand, so think of it as a protocol that defines exactly how functions communicate with each other at machine code level. Consider we have 2 functions
int add(int a, int b) {
return a + b;
}
int main() {
int result = add(5, 10);
return result;
}
When main
calls add
several low-level questions need answering:
- Where does
main
put the values 5 and 10 so thatadd
can find them? - Where does
add
puts the result 15 somain
can retrive it? - How is the function call actually made at the assembly level?
An ABI provides these specific answers, for example in x85-64 a common ABI might specify:
- First int arguement goes in RDI register
- Second int argument goes in RSI register
- Return values goes in RAX register
- Stack must be 16-byte aligned before calls
mov rdi, 5
mov rsi, 10
call add
- ABI \(\neq\) Assembly, targets have their ABI and assembly instructions are used to code them
- You must define your own ABI when writing your own backend. For interoperability, it is recommended to use the existing if available
- One target can have multiple ABIs possible but for a particular calling convention there is one matching ABI.
- Different compilers following same ABI rules then a function compiled on one can interact with the one compiled by other.
Encoding
Encoding is the specification of how to translate assembly instructions into the actual 1s and 0s that the CPU reads. Let’s say we want to encode add eax, ebx
The x86 encoding might be:
Binary: 00000001 11011000
Hex: 01 D8
The ISA document of a particular target tells you exactly what each bit means:
First Byte: 01 (Opcode)
01
= ADD instruction with format “r/m32, r32” (meaning: destination=r/m, source=reg)
Second Byte: D8 (ModR/M byte)
Binary: 11011000 Breaking down the ModR/M byte:
- Bits 7-6 (MOD): 11 = both operands are registers (not memory)
- Bits 5-3 (REG): 011 = EBX register (source)
- Bits 2-0 (R/M): 000 = EAX register (destination)
So 01 D8
means: “ADD the value in EBX (reg field) to EAX (r/m field)”
Misc
Virtual and Physical registers
Physical registers are the actual registers that exist in your CPU:
x86-64 has: RAX, RBX, RCX, RDX, RSI, RDI, R8-R15, etc.
ARM has: R0-R15, etc.
They are limited in numbers as x86-64 has ~16 general-purpose registers and they are specifically named.
On the other hand, Virtual registers are the temporary names created by the compiler
%vreg0, %vreg1, %vreg2, %vreg3, %vreg4, %vreg5, ...
They can be unlimited, doesn’t correspond to real hardware and generically numbered. Machine IR uses virtual registers and in later steps in compilation they get mapped to the physical registers with assembly.
Structures & Relations within LLVM programs
A typical program is divided into the following structure in LLVM processing; top-down manner:
- Modules
- Functions
- Basic Block
- Instruction
- Definition
- Arguments / operands
- operation code / opcode
Following are the way these components or structures are connected
- Control Flow Graphs (CFG)
Modules:
A module is essentially the container for everything the compiler is working on at a given time. Think of it as:
Input File (e.g., main.c) → LLVM Module
The module contains:
- All function definitions (main(), foo(), etc.)
- Global variables
- Metadata
- Everything else needed to compile that input
It is also known as Translation Unit or Compilation Unit, all refer to same concept - the unit of compilation.